-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Add spec for the data section string literals feature #76139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spec for the data section string literals feature #76139
Conversation
utf8-string-literal-encoding feature flag|
@jjonescz Consider updating PR's title. The "feature flag" is just part of the feature, it is not the feature. |
|
Done with review pass (commit 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (commit 11)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Thanks (iteration 11)
|
Would be interesting to extend this to encoding constant strings being passed to The protobuf compiler for example generates such strings. And BTW if I replace |
Could you please elaborate what would be an advantage for doing this? Also, how does this align with the specific goal that we are trying to achieve here, which is not a goal of supporting different encodings for string literals? |
|
|
||
| For every unique string literal, a unique internal static class is generated which: | ||
| - has name composed of `<S>` followed by a hex-encoded XXH128 hash of the string | ||
| (collisions [should not happen][xxh128] with XXH128 and so they aren't currently detected or reported, the behavior in that case is undefined), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a security bug that the collision is undetected?
The collisions can happen with some very small but non-zero probability. If a program is unlucky to hit the collision, the string literals will be messed up and the behavior of the program will be impacted in interesting ways. This can be used by adversaries to contribute an innocently looking string to an open-source project where the string is carefully crafted to cause a collision with some other critical string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few places in the compiler where we use hashes and collisions could lead to program behavior changes. These are part of our security reviews in the past and this will be included in our next one. Thus far nothing has risen to the level of forcing extra checking. There is an increase in non-cryptographic hashing support though and I will make sure that is included in the next security review. @RikkiGibson
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, XXH128 is non-cryptographic hash, so one should assume that it is easy to find collisions. If we are going to keep the current implementation for this feature, I would be curious to know how we are justifying that it is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understand the concern here. But you can create similar collisions today at much lower level places in the compiler by creating collisions with file contents. The use of non-crypto hashes was approved at that layer too (was in fact suggested by security team). The general reasoning is that once you can control code being submitted to the project you have lots of avenues for subverting expectations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general reasoning is that once you can control code being submitted to the project you have lots of avenues for subverting expectations.
In the open-source world, it is important that the project maintainers are able to review the changes with confidence.
For example, let's say that this feature gets enabled for some part of OpenTelemetry project and somebody submits a PR that includes a code like this:
void DiagnosticThatIsTurnedOffByDefault()
{
// Random unique string to help us identify blah
DoStuff("ruroiodffdshofps09rw403492049358302849052peldfjskdfshgfdshiofofjsekljhkdkjafhkds3");
}
How can the maintainer of the OpenTelemetry project tell that this change is actually planting a backdoor into an unrelated code due to a hash collision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xxHash128 is also used to identify source documents in the IDE currently, as well as to identify interceptable calls in source code. We consider it collision-safe for these purposes. See also the Examples at the bottom of this doc: https://web.archive.org/web/20240223124645/https://fossies.org/linux/xxHash/tests/collisions/README.md
@CyrusNajmabadi in case you have further thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xxHash128 is also used to identify source documents in the IDE currently
I would not be worried about the IDE case from the security angle. If somebody crafts a file name with a hash collision and it makes the IDE to misbehave or crash, it is not a big deal. (You may be scratching your head for a while if you are tasked with investigating the crash report.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I think it should be straightforward to add collision detection, it just didn't seem straightforward to test that, but I can look into that more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think hash collisions might not be a problem - #77061 (comment)
|
|
||
| ### Runtime support | ||
|
|
||
| We could emit the strings to data fields similarly but we would not synthesize the `string` fields and static constructors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this happen if you plan to advertise this feature for broader use, and not just as a workaround for the few folks who run into the limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is to use the current state of implementation, compiler only trick, to get feedback on the scenarios. We did a number of benchmarks on the work and didn't see meaningful differences that justified preemptively doing runtime work. Decided that we would ship it out as experimental, get real world feedback from teams like Bing, and then let that drive what we wanted the final state of the feature to be before we remove experimental lable
Implementation: #76036
Link to rendered markdown: https://github.com/jjonescz/roslyn/blob/DataSectionStringLiterals-spec/docs/features/string-literals-data-section.md